MetaGS: an accurate method to impute and combine SNP effects across populations using summary statistics

Background Meta-analysis describes a category of statistical methods that aim at combining the results of multiple studies to increase statistical power by exploiting summary statistics. Different industries that use genomic prediction do not share their raw data due to logistic or privacy restrictions, which can limit the size of their reference populations and creates a need for a practical meta-analysis method. Results We developed a meta-analysis, named MetaGS, that duplicates the results of multi-trait best linear unbiased prediction (mBLUP) analysis without accessing raw data. MetaGS exploits the correlations among different populations to produce more accurate population-specific single nucleotide polymorphism (SNP) effects. The method improves SNP effect estimations for a given population depending on its relations to other populations. MetaGS was tested on milk, fat and protein yield data of Australian Holstein and Jersey cattle and it generated very similar genomic estimated breeding values to those produced using the mBLUP method for all traits in both breeds. One of the major difficulties when combining SNP effects across populations is the use of different variants for the populations, which limits the applications of meta-analysis in practice. We solved this issue by developing a method to impute missing summary statistics without using raw data. Our results showed that imputing summary statistics can be done with high accuracy (r > 0.9) even when more than 70% of the SNPs were missing with a minimal effect on prediction accuracy. Conclusions We demonstrated that MetaGS can replace the mBLUP model when raw data cannot be shared, which can lead to more flexible collaborations compared to the single-trait BLUP model.


Background
The simultaneous estimation of the genomic effect on quantitative traits using all available variants, named genomic prediction [1], has revolutionized the fields of plant breeding [2], animal breeding [3] and personalized medicine [4]. This is because it considers the linkage disequilibrium (LD) between variants when they are all fitted in a single model instead of attributing the same variation to multiple variants in high LD when running single variant association analyses [5]. Complex traits are usually controlled by a large number of genes with small effects, which requires a very large population to train the prediction equation with any reliability [6]. Such large reference sets with genotyped and phenotyped individuals may not be available for all populations and/or traits which requires combining different populations in a single analysis. Ideally, a joint analysis using the raw phenotypic and genotypic data of different populations can be run to get more accurate predictions. However, restrictions on sharing raw data and privacy regulations limit such possibilities [7]. While sharing individual-level data is generally constrained, it is widely acceptable to share summary statistics that are descriptive for the prediction outcome and cannot be used to infer the original data. Such summary statistics can be combined in a data analysis frame called meta-analysis [8].
A meta-analysis refers to the inference of more robust outcomes by combining the results of multiple studies in a single analysis, in other words, the analysis of analyses. The term "meta-analysis" was first proposed by Glass [9]. Multi-trait across country evaluation (MACE) is a traditional example of a meta-analysis in animal breeding [10], which uses as input the estimated breeding values (EBV) of bulls calculated within each country rather than the bull's raw data. VanRaden and Sullivan [11] extended the MACE model to include genomic information (GMACE) which facilitated better prediction for young bulls with no daughter proofs [12]. GMACE has been successfully used for over a decade and it has a proven capability to increase the reliability of bull's genomic EBV (GEBV). However, an alternative approach is required to analyze data from other organisms.
To estimate the substitution effect of single nucleotide polymorphisms (SNPs) when raw data are not available, different meta-based models were proposed regardless of whether the variants were fitted independently as in genome-wide association studies (GWAS) or jointly as in genomic prediction studies [13][14][15][16]. Although these models resulted in higher prediction accuracy compared to all participant individual data, most of them showed a lower accuracy when compared to the joint analysis that uses individual-level data, as they depend on approximation [14]. The model proposed by Vandenplas et al. [16] was designed to exploit both raw data as well as summary statistics in the same analysis by fitting multiple records per trait and it showed minimal approximation compared to other models using simulated data. However, it assumes that the genetic correlation among all participant populations is equal to 1 which may not be the case for many populations/traits. Another drawback with these models is that they all require participant populations to be genotyped with the same set of SNPs, which limit their practical application in real life.
The objective of the present study was to develop a meta-analysis (named MetaGS) that mimics the multiple-trait best linear unbiased prediction (mBLUP) model using raw data with limited approximation. Our model assumes that each population would benefit from its correlation with other populations to gain more accurate population-specific SNP effects. Participant data holders are expected to share the LD matrices, the frequencies, and the effects of their SNPs as well as the variance of the direct genomic value and error variance in their populations. The model is sufficiently flexible to allow for imputing missing variants and their effects in different populations using the previous summary statistics without accessing the individual-based data. The model was tested on Australian Holstein and Jersey cattle populations for milk, fat and protein yields and it showed that it can duplicate the joint analysis that uses the raw data with no approximation.

Methods
Our multi-trait BLUP model assumes that the effects of a SNP in population i and j ( g i and g j ) are genetically correlated with the same correlation as the genetic correlation between true breeding values in different populations. Such a correlation can be inferred from other tests such as the correlation between estimated breeding values or the MACE. Within population i ( i = 1,…,c), the SNP effects are g i , where g i is a vector of SNP effects in population i.

SNP effect estimation in a single population
The input to the meta-analysis are SNP effects estimated within each population excluding foreign data. We assume that the input individual estimates of SNP effects for population i are estimated with a SNP BLUP model [17] that would be equivalent to: where y i is a vector of phenotypes (in our data, average of daughter phenotypes) of the training or reference population corrected for all effects except the additive genetic effects explained by the SNPs; µ i is the general mean of population i ; 1 is a vector of 1s; Z i represents the design matrix for the genotypes of reference individuals. Genotypic values of reference individuals take three possible values: 2 − 2p ij , 1 − 2p ij and 0 − 2p ij for genotypes AA, AB or BB, respectively [18], p ij is the allele frequency of SNP j ( j = 1, …, m ) of population i ; e i is a vector of residual effects for the reference population (e.g. in a sire genomic model) with a (co)variance matrix as follows: where σ 2 e i is the error variance of population i , and n ik is the effective number of daughters contributing to y ik of the reference individual k in population i . We tested the use of population-specific allele frequencies (which vary from one population to another) or a unified frequency (using pooled individuals from all populations) and we did not get any difference in the results (results not shown). Strandén and Christensen [19] showed that SNP-BLUP is invariant to the frequencies used for genotype centering if the model includes a mean which agrees with our results. (1) Under the SNP BLUP model [17], SNP effects are distributed as: with σ 2 i being the variance of the direct genomic values (DGV) of population i.
DGV represents the sum of all SNP effects: where DGV ik is the breeding value of individual k explained by SNPs; z ik is a row in the design matrix Z i corresponding to individual k . For this model, the mixed model equations for population i are:

(Co)variance of SNP effects in different populations
For the multi-trait BLUP model, SNP effects from different populations have the following (co)variance matrix: where σ 1c is the DGV covariance between population 1 and c. Similar to the definition of matrix B i for population i , matrix B i,c for the two populations relies on the assumption that the same set of SNPs is used in the two populations: The (co)variance matrix of the population SNP effects, Eq. (7), becomes: var     and its inverse matrix is:

Estimation of SNP effects in the meta-analysis model (MetaGS)
The effects of the meta-analysis model are estimated using the following mixed model equations: All the terms in Eq. (11) can be derived from the individual population analyses. Data holders need to submit the components that allow to build Eq. (11) which are for population i : (1) SNP effect estimates g i or the right-hand for a measure of prediction error (co)variances of the SNP effect estimates; (3) marker allele frequencies of a reference SNP allele such as allele A in the population; and (4) the variance of direct genomic values. All the participating data holders must code the two SNP alleles A and B in the same way (i.e. using a specific reference genome, homozygosity for the reference allele can be coded as 0, homozygosity for the alternative allele can be coded as 2, and heterozygosity can be coded as 1), so they end with equivalent g i estimations and Z i ′ R −1 i Z i matrices across populations. In the present paper, we assumed that there was no residual polygenic (RPG) effect. However, if an RPG effect exists, it can be fitted during the estimation of SNP effects for each individual population following Liu et al. [17] without affecting the meta-analysis equations. This was validated using commercial Brown Swiss bulls' data from six countries, but the (results not shown here).
It is not necessary for data holders to submit multiple Z i ′ R −1 i Z i matrices if they phenotyped different sets of individuals for different traits (assuming similar reliabilities across traits) as these matrices are supposed to be correlated. Instead, they are required to submit the number of reference individuals ( α ) used to generate Z i ′ R −1 i Z i as well as the number of phenotyped reference individuals (bulls in our case) for trait i ( n i ). The Z i ′ R −1 i Z i matrix can be rescaled with the number of phenotypes to avoid overestimating the magnitude of populations with missing phenotypes using the following equation:

Handling different sets of SNPs between populations
Here, we propose a method to account for different SNP datasets used by different populations. In this method, we expand the list of SNPs to include all SNPs used by any of the participating populations. Equation (6) shows how the right-hand sides (RHS) of the mixed model equations for each population can be obtained from the left-hand sides (LHS) that the population provides i.e. the design matrix and the SNP solutions. However, these RHS are missing for SNPs that are not used by that population, so we impute the missing RHS as follows. We assume that, due to LD among the SNPs, the genotypes for the complete set of SNPs ( Z c ) on the bulls used by population i are related to the genotypes for non-missing SNPs ( Z i ) by This gives all the necessary inputs for Eq. (11). Therefore, the method does not directly impute the input SNPs but instead imputes the underlying matrices that are adequate to estimate SNP effects using the MetaGS model based on the LD between existing and missing SNPs.
After the complete equations have been solved and yield prediction equations for each population based on the complete SNP set, the solutions for the original SNP set of population i can be obtained by: where g c are the SNP solutions for population i based on the complete SNP set and g i are the solutions for the original SNP set of population i . In the validation analysis of this paper, we calculated multiple submatrices for T for every 200 adjacent SNPs on the same chromosome to make the matrix invertible. Therefore, the full T matrix became a block-diagonal matrix.

Validation data
The data involved Australian Holstein (4627 bulls) and Jersey (1178 bulls The bivariate model assumes that the phenotypes of the two breeds for the same trait are two different correlated traits. The genetic correlation between Holstein and Jersey bulls was estimated using MTG2 and fitted in the meta-analysis to ensure that the same G matrix was built with both models. Correlations were 0.54, 0.36 and 0.33 for milk, fat and protein yields, respectively. The prediction accuracy was inferred from correlations between DGV and the phenotypes of the validation population. To test the accuracy of rescaling the Z ′ R −1 Z matrix, Eq. (12) was applied after masking different proportions (ranging from 0.01 to 0.95 with a 0.01 increment) of the Holstein and Jersey populations. The accuracy was inferred from the correlation between the rescaled Z ′ R −1 Z matrices and the original Z ′ R −1 Z matrix that was calculated using all the sires. This process was replicated 100 times to calculate the confidence interval for the accuracy.
To test the accuracy of imputing the Z ′ R −1 Z matrix and the Z ′ R −1 y and g vectors, each of the Holstein or Jersey populations were divided into three equal sets (Fig. 1). One third was randomly selected as a reference to build the T matrix (Eq. (13)). Another third of the remaining bulls was used to validate the imputation accuracy. Masking different proportions (ranging from 0.01 to 0.95 with a 0.05 increment) of the SNPs was randomly applied to the validation bulls to be imputed. The accuracy was inferred from the correlation between the imputed SNPs in the three matrices or vectors ( Z ′ R −1 Z , Z ′ R −1 y and g ) and their values in the original three matrices or vectors (16) calculated using all the SNPs. This process was replicated 100 times to calculate the confidence interval for the accuracy. The population used to validate the imputation method was also used as a training population to run the MetaGS model using the imputed matrices (all masking proportions and replicates). The last third of the Holstein and Jersey populations was used to calculate and compare the phenotype prediction accuracies using SNP effects calculated for all SNP masking scenarios using two types of inputs: Z ′ R −1 Z plus g or Z ′ R −1 Z plus Z ′ R −1 y . Here, the accuracy was inferred from the correlations between DGV and masked phenotypes.

Results
The meta-analysis developed here (MetaGS) was compared to the single trait (ST) analysis as well as the joint analysis (mBLUP) that exploits raw data for both Australian Holstein (HOL) and Jersey (JER) cattle breeds for three traits. Compared to the ST model, the average correlation across traits between SNP effects produced by MetaGS and ST was 0.97 for HOL and 0.76 for JER (data not shown). MetaGS performed almost the same as mBLUP and they produced highly correlated SNP effects with correlation coefficient values ranging from 0.98 to 0.99. The correlation was even higher (r = 1) when comparing the GEBV for all traits in both HOL and JER populations (Fig. 2).
Testing the three models on the validation sets showed comparable prediction accuracies for the three traits when predicting the GEBV of a breed using the SNP effects that were trained on the same reference breed. The average prediction accuracies for both breeds were 0.487, 0.498 and 0.503 for the ST, mBLUP and MetaGS models, respectively (Table 1). However, when performing across-breed prediction (i.e. predicting HOL from JER SNP effects and via versa), the mBLUP and MetaGS models showed superiority compared to the ST model. Predicting the JER validation set using the HOL SNP effects of the ST model had an average prediction accuracy of only 0.043, while the value was equal to 0.397 and 0.387 for the mBLUP and MetaGS models, respectively. Similarly, predicting the HOL validation set using the JER SNP effects had average prediction accuracies of 0.217, 0.403 and 0.42 for the ST, mBLUP and MetaGS models, respectively (Table 1).
Rescaling the Z ′ R −1 Z matrix using Eq. (12) has a practical advantage to avoid sharing multiple Z ′ R −1 Z matrices if the meta-analysis study was planned for multiple traits measured on different overlapping sets of individuals within the same population. Figure 3a shows the accuracy and standard deviation of rescaling Z ′ R −1 Z for the HOL and JER populations when masking a proportion of all individuals (between 1 and 95%). The method preserved a high accuracy (> 0.9 for JER and > 0.95 for HOL) even when more than 50% of the population was masked. Moreover, the method did not magnify or shrink the rescaled Z ′ R −1 Z compared to the original one since the regression slope always had an average close to 1 with small 95% confidence intervals of 0.01 for HOL and 0.02 for JER when masking 50% of the population (Fig. 3b). Different data holders might use different sets of SNPs. Even when different populations were genotyped with the same genotyping platform or chip, they may have filtered them differently. Thus, we proposed a method to impute missing variants in the Z ′ R −1 Z matrix and the Z ′ R −1 y and g vectors without accessing the raw data using an independent reference population. The accuracy of imputing the g vector was low even when only 1% of the SNPs were masked (0.81 for HOL and 0.63 for JER; Fig. 4). However, the method imputed the Z ′ R −1 Z matrix and the Z ′ R −1 y vector with high accuracy even at high SNP masking rates. The left-hand side matrix had an accuracy higher than 0.92 when up to 70% of the SNPs were masked while the right-hand side had an accuracy higher than 0.9 when up to 90% of the SNPs were masked in both HOL and JER breeds (Fig. 4). However, the low accuracy of predicting the g vector did not affect the final prediction accuracy results, regardless of whether we used the g vector or the Z ′ R −1 y vector in the MetaGS analysis (Fig. 5). Our results clearly demonstrated no changes in DGV prediction accuracy when using imputed vectors ( g or Z ′ R −1 y ) with a masking proportion of up to 70% compared to the original scenario that used the complete dataset, i.e. the masking  Table 1 Prediction accuracy for milk, fat and protein yields of sires with phenotyped daughters, using single trait analysis (ST), multitrait analysis (mBLUP) and meta-analysis (MetaGS) The values represent the prediction accuracy (as the correlation coefficient between DGV and the phenotypes of the validation population) for the validation population (rows) using SNP effects calculated on the reference population (columns)  proportion is equal to zero for all traits in both Holstein and Jersey populations (Fig. 5).

Accuracy of MetaGS
We demonstrated that the MetaGS model can duplicate the estimation of SNP effects and prediction results of the mBLUP model. The outcomes of both models were highly correlated in all tested scenarios. For the Holstein population, SNP effects of both models were highly correlated with the ST model (r = 0.97) unlike the Jersey population which had an average correlation coefficient of 0.76. This is mainly due to the large reference population for the Holstein breed (4105 bulls) compared to the Jersey breed (1071 bulls), which limited the effect of the Jersey's input on the Holstein SNP effects. On the other hand, the smaller Jersey population benefits more from its correlation with the larger Holstein population. Applying SNP effects on the validation population showed that the mBLUP and MetaGS models had no accuracy gain over the ST model when predicting a breed's performance using the SNP effects of the same breed. This result might be specific to our data as previous studies reported higher accuracy of multi-trait models over single-trait models [16,21]. Anyway, it is outside the scope of the current paper to demonstrate the advantage of mBLUP or MetaGS over the ST model since MetaGS was developed to reproduce mBLUP results without accessing raw data. However, mBLUP and MetaGS had on average nine times and two times higher accuracy for the Jersey and Holstein breeds, respectively, compared to the ST model when predicting the performance of one breed using the SNP effects of the other breed. This is expected given that mBLUP and MetaGS use the information of different breeds. Different breeds usually have different LD structures [22]. Consequently, 50k SNPs is not a sufficiently dense SNP panel so that the correlation between SNPs and causal variants is the same in Holstein and Jersey.
Unlike other models that produce a global effect value per SNP [14,16], MetaGS calculates populationspecific SNP effects considering variation in LD structures and genetic correlation among populations. For this reason, the model can fit complex traits with high genotype × environment interactions without reducing their accuracies. Even in the most extreme scenario in which different populations were completely uncorrelated, i.e. the genetic correlation coefficients among populations were equal to zero, the resulting SNP effects after running MetaGS will be equal to the input population-specific effects obtained from the ST analysis, assuming no imputation is required.

Solving practical difficulties
The two major issues that limit the application of metaanalyses to combine SNP effects of different genomic prediction studies are the differences in analyzed variants and the size of the required summary statistic files that need to be shared. The framework of the MetaGS method is flexible enough to solve these issues, which makes it more acceptable from a practical point of view.
Rescaling the Z ′ R −1 Z matrix For different measured traits on the same population, usually the number of phenotyped individuals varies slightly, while other traits are measured on a small proportion of the population. For this reason, each trait should have its own Z ′ R −1 Z matrix. The dimension of this matrix is m × m , where m is the number of SNPs, making it a huge matrix for sharing, especially if the meta-analysis was planned for multiple traits. Our results showed that there is no need to share multiple Z ′ R −1 Z matrices for the same population as these can be accurately predicted from one to another using Eq. (12) even when more than half of the population had no phenotypic records. The Z ′ R −1 Z matrix represents the LD structure in the population, which makes it redundant for different traits except for a difference in scale that is inferred from the number of phenotyped individuals. For this reason, it is recommended that each data holder should calculate a single Z ′ R −1 Z matrix using all the genotyped individuals even if they do not have phenotypes and this can be rescaled for each trait depending on the number of individuals having phenotypic records. However, sharing Z ′ R −1 Z matrices may still be required for traits with very small reference populations or deviated R −1 patterns. As this matrix is symmetric, data holders need to share only the upper (or lower) triangle with the diagonal. This information can be saved in files with binary format to further reduce the size of the transferred materials.

Imputing summary statistics for missing variants
MetaGS provided a comprehensive mathematical frame to synchronize variants in different datasets by imputing them from a reference population without accessing the raw data using the T matrix. We showed that our imputation method had a minimal effect on prediction accuracy even at the high SNP masking rate of 70%. Calculating the T matrix requires inverting the Z ′ i Z i matrix and in order to make it invertible, the number of variants must be smaller than the number of individuals in the reference population. While it is impossible to get such a large reference population for most organisms, we recommend applying Eq. (13) within each LD block, separately, and setting all off-diagonal or inter LD block elements to zero.
In our analysis, we calculated the T matrix for each 200 adjacent SNPs, but the accuracy was not affected when using different numbers of SNPs per LD block (data not shown). In genomic prediction, the effect of a causal variant can be distributed over multiple variants in high LD with it since they are all fitted together in one model [23], unlike in GWAS in which overlapping sources of variation can be attributed to multiple variants fitted independently [5]. For this reason, unsynchronized variants can be easily filtered out in GWAS as they were fitted independently but this cannot be done in genomic prediction. This can explain why the accuracy of imputing the g vector was relatively low compared to that of the Z ′ R −1 Z matrix and Z ′ R −1 y vector even at very low SNP masking rates such as 1% (Fig. 4). Different variants in high LD can have variable effect values but they would end with comparable values on the right-hand side (or the Z ′ R −1 y vector) when they are multiplied with the allelic dosage for each individual or the Z ′ R −1 Z matrix. This can explain why the prediction accuracy did not change when using the Z ′ R −1 y vector compared to the analysis that used the g vector as input (Fig. 5). It is worth noting that the accuracies in Fig. 5 are much larger than those in Table 1 given that in the former, one random third of the population was used for validation, while in the latter, young bulls that were born after 2010 were used for validation. Therefore, the reference and validation populations in the former were more related.
While our results demonstrated that missing variants can be imputed with high accuracy even at high variant masking rates, it is worth noting that the imputation accuracy depends on the relatedness between the tested population and the reference populations, like any other imputation method [24]. The reference population assumed here is expected to be genotyped with all the variants used across studies. One option to collect such a population, if it is not available, is to select a subset of representative individuals within each population to be genotyped with the full list of variants that are planned to be used in the meta-analysis. These individuals can be used within each population to impute other individuals with any imputation algorithm such as FImpute [24] or Minimac [25]. Another option is to have an agreement among all data holders to share the genotyping of a few random individuals (e.g. 100 or 200 individuals) without sharing their ID or phenotypes. Participants can share a single random phased haplotype per individual so the actual genotype cannot be revealed, which allows them to share large numbers of haplotypes without any risk. Any imputation algorithm can then be applied after gathering all the individuals to fill the missing variants across